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Abstract 


AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear 
combination. The linear combination may be viewed as a hyperplane in feature space where the 
base hypotheses form the features. It has been observed that the generalization error of the algo- 
rithm continues to improve even after all examples are on the correct side of the current hyperplane. 
The improvement is attributed to the experimental observation that the distances (margins) of the 
examples to the separating hyperplane are increasing even after all examples are on the correct side. 

We introduce a new version of AdaBoost, called AdaBoost,, that explicitly maximizes the 
minimum margin of the examples up to a given precision. The algorithm incorporates a current es- 
timate of the achievable margin into its calculation of the linear coefficients of the base hypotheses. 
The bound on the number of iterations needed by the new algorithms is the same as the number 
needed by a known version of AdaBoost that must have an explicit estimate of the achievable mar- 
gin as a parameter. We also illustrate experimentally that our algorithm requires considerably fewer 
iterations than other algorithms that aim to maximize the margin. 


1. Introduction 


Boosting algorithms are greedy methods for forming linear combinations of base hypotheses. In the 
most common scenario the algorithm is given a fixed set of labeled training examples and in each 
iteration updates a distribution on these examples (i.e. a set of non-negative weights that sum to 
one). It then is given a base hypothesis whose weighted error (probability of wrong classification) 
is slightly below 50%. This base hypothesis is used to update the distribution on the examples: 
The algorithm increases the weights of those examples that were wrongly classified by the base 
hypothesis. At the end of each stage the base hypothesis is added to the linear combination, and the 
sign of this linear combination forms the current hypothesis of the boosting algorithm. 
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The most well known boosting algorithm is AdaBoost (Freund and Schapire, 1997). It is ’adap- 
tive” in that the linear coefficient of the base hypothesis depends on the weighted error of the base 
hypothesis at the time when the base hypothesis was added to the linear combination. AdaBoost 
has two interesting properties. First, along with earlier boosting algorithms (Schapire, 1992; Freund, 
1995), its training error has the following exponential convergence property: if the weighted train- 
ing error of the f-th base hypothesis is €; = 5 — iYe, then an upper bound on the training error of 
the signed linear combination is reduced by a factor of 1 — HA at stage t. Second, it has been ob- 
served experimentally that AdaBoost continues to “learn” even after the training error of the signed 
linear combination is zero (Schapire et al., 1998). That is, in experiments the generalization error 
continues to improve. The signed linear combination can be viewed as a homogeneous hyperplane 
in a feature space, where each base hypothesis represents one feature or dimension. We define the 
margin of an example as a signed distance to the hyperplane times its + label (See Section [2]and 
Appendix (A| for precise definitions). As soon as the training error is zero, the examples are on the 
right side and all have positive margin. It has also been observed that the margins of the examples 
continue to increase even after the oak error is zero. There are theoretical bounds on the gen- 


eralization error of linear classifiers (e.g. Schapire et al., 1998; Breiman, 1999; Koltchinskii et al., 























































































































2001) that improve with the margin of the classifier, which is defined as the size of the minimum 





margin of the examples. Thus the fact that the margins improve experimentally seems to explain 
why AdaBoost still learns after the training error is zero. 





There is one flaw in this argument: AdaBoost has not been proven to maximize the margin of the 
final hypothesis. We demonstrate this experimentally in Section |5,; Moreover, Rudin et al 2004a, 























2005) recently showed that there are cases where AdaBoost provably does not maximize the margin. 

















Breiman (1999) proposed a modified algorithm — called Arc-GV (Arcing-Game Value) — suitable 








for this task and showed that it asymptotically maximizes the margin. Similar results are shown in 
Grove and Schuurmans (1998) and Bennett et al. (2000). In this paper we present an algorithm that 
produces a final hypothesis with margin at least p* — v, where p* is the unknown maximum margin 
achievable by any convex combination of base hypotheses and v a precision parameter. 

If we know p*, then a linear combination with margin at least p* — v can be found by a pa- 
rameterized version of AdaBoost called AdaBoostp (cf. a et A 2001); Rätsch and Warmuth 
2002)): When the parameter p of AdaBoostp is set to p* — v, then after aN iterations, where 
N is the number of examples, the margin of the produced linear combination is guaranteed to be 
at least p* —v. The case when p* is not known is more difficult. In a preliminary conference 
paper (Ratsch and Warmuth, 2002) we used AdaBoostp iteratively in a binary search like fashion: 
log,(2/v) calls to AdaBoosty are guaranteed to produce a margin at least p* — v. All but the last call 
to AdaBoostp are used to find a suitable value of the parameter p and in the last call this parameter 
is used to create the final linear combination in at most “5 iterations. 

In this paper we greatly simplify our answer for the case when p* is unknown. We have a 
new one pass algorithm called AdaBoost;, that produces a linear combination with margin at least 
p* — v after ZN iterations. Note that this is the same guarantee we had on the number of iterations 
of AdaBoostp when it used the theoretically optimal parameter p = p* — v. The new algorithm 
AdaBoost; uses the parameter v and a current estimate of the achievable margin in the computation 
of the linear coefficients of the base learners. 

Except for the algorithm presented in the previous conference paper, this is the first result on 
the fast convergence of a boosting algorithm to the maximum margin solution that works for all 


p* € |—1,1]. Using previous results one can only show that AdaBoost asymptotically converges to 
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a final hypothesis with margin at least p*/2 if p* > 0 and if subtle conditions on the chosen base 
hypotheses are satisfied (cf. Corollary 5). 





Recently other versions of AdaBoost have been published that are guaranteed to produce a lin- 
ear combination of margin at least p* — v after Q(v~°) iterations (Rudin et al. 2004¢,b . Even 
though these algorithms have weaker iteration bounds than AdaBoost;, they were reported to per- 
form better experimentally (Rudin et al., 2004c,a . We briefly compare AdaBoost to these more 
recent algorithms and show that the better empirical performance was due to the wrong choice of v. 












































The original AdaBoost was designed to find a final hypothesis of margin at least zero. Our 
algorithm maximizes the margin for all values of p*. This includes the inseparable case (i.e. p* < 0), 
where one minimizes the overlap between the two classes. In this case AdaBoost runs forever 
without necessarily increasing the margin. Our algorithm is also useful when the base hypotheses 
given to the Boosting algorithm are strong in the sense that they already separate the data and 
have margin greater than zero, but less than one. In this case 0 < p* < 1 and AdaBoost aborts 
immediately because the linear coefficients of such hypotheses become unbounded. In contrast, our 
new algorithm also maximizes the margin when presented with strong learners. 

The big advantage of this algorithm is an absolute bound on the number of iterations: After 
ad iterations AdaBoost, is guaranteed to produce a hypothesis with margin at least p* — v. Our 
algorithm is applicable in sophisticated settings where the number of hypotheses may be infinite. In 
Appendix B|we use AdaBoost, to learn a convex combination of support vector kernels and show 
that the same guarantees hold on the number of iterations of the algorithm. 
































The paper is structured as follows: Section|2/introduces some basic notation and in Section |3|we 
first describe AdaBoosty which requires a lower bound p of the maximum margin p* as a parameter. 
Then we present our new algorithm AdaBoost,, which is similar to AdaBoostp, but continuously 
adapts p based on a precision parameter v. Up to this point we stay at a high level of presentation 
with the goal of making our algorithms accessible to the quick reader. In Section |4| we introduce 
more notation and give a detailed analysis of both algorithms. First, we prove that if the weighted 
training error of the t-th base hypothesis is €& = 5 = iY, then an upper bound on the fraction of 
examples with margin smaller than p is reduced by a factor of 1 — (p —y,)? at stage t of AdaBoostp 
(cf. Section 4.2) (A slightly improved factor is shown for the case when p > 0). However, to achieve 
a large margin one needs to assume that the guess p is smaller than p*. For the latter case we prove 
an exponential convergence rate of AdaBoosty. Then we discuss a method for automatically tuning 
p depending on the errors of the base hypotheses and a precision parameter v. We show that after 
roughly oink iterations our new one-pass algorithm AdaBoost; is guaranteed to produce a linear 
combination with margin at least p* — v. This strengthens the results of our preliminary conference 
paper (Ratsch and Warmuth T, which had an additional log,(2/v) factor in the total number 
times the weak learner is called and much higher constants. In Section'5, we compare the algorithms 
experimentally and discuss heuristics for tuning v in Section|5.2! Finally we briefly summarize and 
discuss our results in the Conclusion Section. 















































2. Preliminaries and Basic Notation 


We consider the standard two-class supervised machine learning problem: Given a set of N i.i.d. 
training examples (Xn, Yn); n = 1,...,N, with x, € X and y, E€ Y := {—1,+1}, we would like to 
learn a function f : X — Y that is able to generalize well on unseen data generated from the same 
distribution as the training data. 
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In the case of ensemble learning (like boosting), there is a fixed underlying set of base hypothe- 
ses H := {h | h: X — |—1, 1] } from which the ensemble is built. For now we only assume that H 
is finite, but we will show in Section |4.5/|that this assumption can be dropped in most cases and that 
all of the following analysis also applies to the case of infinite hypothesis sets. 

Boosting algorithms iteratively form non-negative linear combinations of hypotheses from H. 
In each iteration t, a base hypothesis h, € H with a non-negative coefficient a, is added to the linear 
combination. We denote the combined hypothesis as follows (Note that we normalized the weights): 














P T œ 


fo(x) = sign fo(x), where falx) = £ 


T 
t=1 Lr Ot, 


The “black box” that selects the base hypothesis in each iteration is called the weak learner. For 
AdaBoost, it has been shown that if the weak learner is guaranteed to select base hypotheses of 
weighted error slightly below 50%, then the combined hypothesis is consistent with the training set 
in a small number of iterations (Freund and Schapire, 1997). We will discuss bounds on the number 
of iterations in detail in Section|4! Since at most one new base hypothesis is added in each iteration, 
the size of the final hypothesis is bounded by the number of iterations. These bounds are important 
because the sample size bounds provable in the PAC model grow with the size of the final hypothesis 
Schapire, 1992; Freund, 1995). 
In more recent research (Schapire et al., 1998) it was also shown that a bound on the general- 
ization error decreases with the size of the margin of the final hypothesis f. The margin of a single 
example (Xn, Yn) w.r.t. f is defined as yn fa(X,). Thus the margin quantifies by how far this example 
is on the y, side of the hyperplane f. In Appendix|A|we clarify how the margin of an example is re- 
lated to its £..-distance to the hyperplane with normal a. The margin of the combined hypothesis f is 
the minimum margin of all N examples. The goal of this paper is to find a small non-negative linear 
combination of base hypotheses from H with margin close to the maximum achievable margin. 
The following table gives some of the main notations that will be used throughout this paper: 


h(x), h(x) E€ H, anda, >0 . 










































































Symbol Description 








n,N index and number of examples 

m,M index and number of hypotheses if finite 

t,T index and number of iterations 

X input space 

y label space {+1} 

(x,y) an example and its label 

H hm set of base hypotheses and the m-th element 

a hypothesis weight vector 

d weighting on the training set 

I(-) the indicator function: I(true) = 1 and I( false) = 0 
p the margin parameter of AdaBoostp 

{pr} the sequence of margin parameters of AdaBoosty,,; 
p* the maximum margin 

p; margin in the t-th iteration 

v the accuracy parameter of AdaBoost, 

E€ weighted classification error 

a the minimum edge 
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Symbol Description 
Yy an arbitrary edge threshold 
Yr the edge of the t-th hypothesis 





3. AdaBoost, and AdaBoost, 


The original AdaBoost was designed to find a consistent hypothesis f which is defined as a signed 
linear combination f with margin greater zero. We start with a slight modification of AdaBoost, 
which finds (if possible) a linear combination of base learners with margin p, where p is a parameter 
(cf. Algorithm |1 [i We call this algorithm AdaBoostp, as it naturally generalizes AdaBoost for the 
case when the target margin is p. The original AdaBoost algorithm now becomes AdaBoosto. 

















Algorithm 1: — The AdaBoost, algorithm — with margin parameter p 


1. Input: S = ((x1,y1),...,(Xw,yw)), No. of Iterations T, margin target p 
2. Initialize: d} = + for alln=1...N 


N 
3. Do fort =1,...,7 


, 


(a) Train classifier on {S,d‘} and obtain hypothesis h, : x +> [—1, 1] 
N 
(b) Calculate the edge y; of hr: w = ) di ynhi (Xn) 


n=1 
(c) if |y,| = 1, then a, = sign(y,), hi = ht, T = 1; break 
1 1 1 1 

saga i E 
2 1-y 2 I1-p 
dj, EXP (—OY nl; (Xn)) 

Zt i 
where Z, = Z^; d! exp (—Oy¥n/;(Xn)) 





(e) Update weights: d+! = 





T 


4. Output: fo(x) =} 3 


=r hk) 
t=1 EL Ot, 




















The algorithm AdaBoostp was already known as unnormalized Arcing (Breiman, 1999) or 
AdaBoost-type Algorithm (Ratsch et al., 2001). Moreover, it is related to algorithms proposed in 
Freund and Schapire (1999) and Zhang (2002). The only difference from AdaBoost is the choice of 
the hypothesis coefficients o,: An additional term —41n a appears in the expression for the hy- 
pothesis coefficient a. This term vanishes when p = 0. The parameter p can be seen as a guess of 
the maximum margin p*. If p is chosen properly (slightly below p*), then AdaBoostp will converge 
exponentially fast to a combined hypothesis with nearly the maximum margin. See Section |4.2|for 


details. 

































































1. The original AdaBoost algorithm was formulated in terms of weighted training error £, of a base hypothesis. Here 
we use an equivalent more convenient formulation in terms of the edge yp, where €; = 5 — iy (cf. Section/4.1). 
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The following example illustrates how AdaBoostp works. Assume the weak learner returns the 
constant hypothesis h,(x) = 1. The weighted error of this hypothesis is the sum of all negative 
weights, i.e. & = ),,—1 di, and its edge is y = 1 — 2¢;. The coefficient oy is chosen so that the edge 
of h, with respect to the new distribution is exactly p (instead of O as for the original AdaBoost). 
More precisely, the given choice of 0; assures that this edge is p only for +1-valued base hypotheses. 

For a more general base hypothesis h; with continuous range |[—1,+1], choosing œ; such that 
Z, as a function of oO, is minimized, guarantees that the edge of h, with respect to the distribution 
d+! is p. See Schapire and Singer (1999) for a similar discussion. Choosing ©, as in step 3 (d) 
approximately minimizes Z; when the range of h; is [—1, +1]. 

In Kivinen and Warmuth (1999) and Lafferty (1999), the standard boosting algorithms are in- 
terpreted as approximate solutions to the following optimization problem: choose a distribution d 
of maximum entropy subject to the constraints that the edges of the previous hypotheses are equal 
to zero. In this paper we use the inequality constraints that the edges of the previous hypotheses 
are at most p. The a;’s function as Lagrange multipliers for these inequality constraints. Since 

l+x 


— 1l š 4 : : 
g(x) = 5 In ;* is an increasing function, 





















































1,1 
oO, = +Y fire 
2 1-y 2 I1-p 





> 0 iff yp. (1) 


Notice that when p = 0, adding h; or —h; leads to the same distribution d'+!. This symmetry is 
broken for p £0. 

Since one does not know the value of the optimum margin p* is not known beforehand, one also 
needs to find p*. In Rätsch and Warmuth (2002) we presented the Marginal AdaBoost algorithm 
which constructs a sequence {p,}*_, converging to p*. A fast way to find a real value up to a 
certain accuracy v in the interval [—1,1] is a binary search since one needs only log, (2/v) steps? 
Thus the previous Marginal AdaBoost algorithm uses AdaBoostp, (Algorithm 1) to decide whether 
the current guess p, is larger or smaller than p*. Depending on the outcome, p, can be chosen so 
that the region of uncertainty for p* is roughly cut in half. However, in the previous algorithm all 
but the last of the log, (2/v) 

In this paper we propose a different algorithm, called AdaBoost). Here v > 0 is a precision 
parameter. The algorithm finds a non-negative linear combination with margin at least p* — v. Like 
Arc-GV onl 1999), the new algorithm essentially runs AdaBoostp once but instead of using 
a fixed margin estimate p, it updates p in an appropriate way. We shall show iteration bounds 
for our algorithm AdaBoost;, which are not known for Arc-GV. The latter algorithm produces an 
essentially?) monotonically increasing sequence of margin estimates, while in AdaBoost* we use 
a monotonically decreasing sequence. The improved sequence of estimates is based on two new 
theoretical insights, which will be developed in the next section. 

We will show that the number of iterations required by the new one-pass AdaBoost; algorithm 
(see Algorithm (2| for pseudo-code) is at most ZNN, This equals the iteration bound for the best 
algorithm we know of for the case when p* is known and we seek a linear combination of margin 
at least p* —v: AdaBoostp with parameter p = p* — v. The iteration bound for the new algorithm 
is the same as the iteration bound for the last call to AdaBoostp of the previous Marginal AdaBoost 
algorithm. 



























































2. If one knows that p* € [a,b], one needs only log,((b — a) /V) steps. 
3. In the original formulation the sequence was not necessarily increasing, but Rätsch (2001) showed that it leads to the 
same result and easier proofs if one restricts it to be monotonically increasing. 
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Algorithm 2: — The AdaBoost, algorithm — with accuracy parameter v 


1. Input: S= ((x1,y1),...,(Xwv,ywv)), No. of Iterations T, desired accuracy v 
2. Initialize: d! = 1/N for alln = 1 ...N 
3. Do fort =1,...,T, 


(a) Train classifier on {S,d‘} and obtain hypothesis h; : x + [—1, 1] 


N 
(b) Calculate the edge y; of hr: Yr = Ł dh Yn (Xn) 
n=1 


(c) if |y,| = 1, then a, = sign(y,), hi = ht, T = 1; break 


11 11 
(6) Sot hee 
2 1-y 2 I1- 6; 


dj, exp (—Orynlr (Xn)) 








’ 


(f) Update weights: dit! = 


Zi 
where Z; = £^] d! exp (—0Ynhi (Xn) ) 
T œ 
4. Output: fo(x) = $ =r h(x) 
fat Ler=1 Or 





4. Detailed Analysis 


In this section we are going to analyze the algorithms in detail. We start by showing the relationship 
between optimal edges and margins, prove and illustrate the convergence properties of AdaBoostp 
and finally prove the convergence of AdaBoost}. 


4.1 Weak learning and margins 


The standard assumption made on the weak learning algorithm for the PAC analysis of Boosting 
algorithm is that the weak learner returns a hypothesis h from a fixed set H that is slightly better 
than random guessing. That is, that the error rate € is consistently smaller than 5. Note that the 
error rate of 5 could easily be reached by a fair coin, assuming both classes have the same prior 
probabilities. More formally, the error € of a +1 valued hypothesis is defined as the fraction of 
examples that are misclassified. In Boosting this is extended to weighted example sets and the error 
is defined as 





N 


€),(d) = x dy In £ h(Xn)), 


n=1 


where A is the hypothesis returned by the weak learner and I is the indicator function with I(true) = 1 
and I(false) = 0. The distribution d = (d1,...,dy) of the examples is such that d, > 0 and Y^}; dn = 
1. 
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When the range of a hypothesis h is the entire interval [—1, +1], then the edge y,(d) = Z^] dnYnh(Xn) 
a more convenient quantity for measuring the quality of h. This edge is an affine transformation 


of the error for the case when h has range +1: €}(d) = 5 — żyn(d) and €;,(d) < 4 iff y,(d) > 0. 





=7 
Recall from Section |2|that the margin of a given example (Xn, Yn) is defined as yn fa(X,). Also 














recall that H is the set from which the weak learner chooses its base hypotheses. Assume for a 
moment that H is finite. If we combine all hypotheses from H, then the following well-known the- 


in 


Theorem 1 (Min-Max-Theorem, von Neumann (1928)) 





4 


orem establishes the connection between oon and edges (first seen in connection with Boosting 
) 





























Freund and Schapire, 1996; Breiman, 1999 























N M 


Ł dnYnhm(Xn) = Ea SRNA > Amhm(Xn) =P , (2) 


y:=min max 
m=1,....M_ = 
? n=1 ? m=1 


d 


where d € PN, a € ®™ and M = |H|. Here P* denotes the k-dimensional probability simplex. 


Thus, the minimum edge y* that can be achieved over all possible distributions d of the training set 
is equal to the maximum margin p* of any linear combination of hypotheses from #. Also, for any 
non-optimal distributions d and and hypothesis weights œ we always have 


maxyn(d) > Y=p" > min Ynfa(%n)- 


In particular, if the weak learning algorithm is guaranteed to return a hypothesis with edge at least 
y for any distribution on the examples, then y* > y and by the above duality there exists a combined 
hypothesis with margin at least y. If y is equal to its upper bound y* then there exists a combined 
hypothesis with margin exactly y = p* that only uses hypotheses that are actually returned by the 
weak learner in response to certain distributions on the examples. 


From this discussion we can derive a sufficient condition on the weak learning algorithm to reach 


the maximum margin (for the case when H finite). If the weak learner returns hypotheses whose 
edges are at least y*, then there exists a linear combination of these hypotheses that attains a margin 


v 


with margin close to p* (cf. Theorem |6). 


=p*. We will prove later that our AdaBoost; algorithm efficiently finds a linear combination 














Constraining the edges of the previous hypotheses to equal zero (as done in the totally corrective 

















algorithm of Kivinen and Warmuth (1999)) leads to a problem if there is no solution satisfying these 
constraints. At the end of trial t, the set of previous hypotheses is H; = {hy,...,h,} and the totally 
corrective algorithm finds a distribution such that y,(d) = 0, for all h € H. Because of the above 
duality and the fact that H, C H, 





= mi d) < =0*. 
Ý mi mael < =p 


The non-decreasing sequence (y¥) converges to p* from below. If p* > 0, then the equality con- 
straints on the edges are not satisfiable as soon as y; > 0. 


In contrast our new algorithm AdaBoost, is motivated by a system of inequality constraints 


yn(d) < p, for h € H,, where p is adapted. Again, if p < p*, then the system of inequalities with this 





4. This is a zero-sum game with payoff matrix ynhm(Xn). The row player finds a mixture d over the rows/examples and 


the column player a mixture @ over the column/hypotheses. Adding a row/example makes the minimax value of the 
game go down and adding a column/hypothesis makes it go up. 
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§ may not have a solution (and the Lagrange multipliers may diverge to infinity). In AdaBoosty we 
start with p large and decrease it when necessary. As we shall see, the algorithm maintains a margin 
parameter p that is always at least p* — v. 


4.2 Convergence properties of AdaBoost, 


Let AdaBoostyp,, denote the version of AdaBoostp that uses a time varying margin parameter p; at 
iteration t. Thus in step 3 (d) of the algorithm, p is replaced by p;. This extension will be necessary 
for the later analysis of AdaBoost*. The sequences {p,}7_, might be specified while running the 
algorithm. For instance, in the algorithm Arc-GV, Breiman fi 308) chooses p; as amin Vay Xn). 


Breiman (1999) showed that Arc-GV asymptotically converges to the maximum margin (see dis- 
cussion in next section). In the following we answer the question how to best choose the sequence 
{pz} so as to optimize bounds on the fraction of examples which have a margin at most p. 
































Lemma 2 For any p € |—1,1], the final hypothesis fo, of AdaBoost,,,, satisfies the following in- 
equality: 


t=1 


N T T T 
D (YnfalXn) <P) < (íz) exp l bs po} = Į [exp {pa, + InZ, } (3) 
n=l i=1 t=1 





where Z, = r 1 d exp (—uYnh(Xn)) and 0; = + In = lng. 


The proof directly follows from a simple extension r Theorem 1 in Schapire and Singer (1999 
(see also Schapire et al. (1998)). 

We now use a lemma from Ratsch et al. (2001) to upper bound the right hand side (rhs) of the 
above inequality: 
























































Lemma 3 Lety, be the edge of h, in the t-th iteration of AdaBoost,p,. Assume —1 < P; < Y. Then 
for all p € [-1,1], 


1 1 1— 1— 
exp {poy + 1n} < exp ( =P in (TE) Pin (=F). (4) 


Note that this generalizes Theorem 5 of (Freund and Schapire, 1997) to the case when the target 
margin is not zero. 

AdaBoost;),; makes progress, if the rhs of (4) is smaller than one. Suppose we would like to 
reach a margin p on all training examples, where we obviously need to assume p < p*. We can then 
ask which sequence of { P} one should use to find such combined hypothesis in as few iterations 
as possible. The rhs of (4) can be rewritten as 


















































exp (A2(P, Pr) — A2 (P, %)), 


where A2(a,b) := +4% In i4 + 17 In += denotes the binary relative entropy between a,b € [—1, 1]. 
Therefore the rhs sf 4) is minimized for p; = p (independent of y,) and one should always use this 
constant choice. 

This means that when p; = p then the rhs of (4) is reduced by a factor of exp(—A2(p,¥;)), 
which can be upper bounded by inspecting the Taylor expansion at y; = p and noticing that when 


0 <p < Y, all terms of order three and higher are negative: 


exp(—Aa(p,y)) <1- 4 P= 
































,forO<p<¥y. (5) 
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The denominator 1 — p? speeds up the convergence when p >> 0. Notice that when p = 0, then we 
recover the original AdaBoost bound. 

Now we determine an upper bound on the number of iterations needed by AdaBoostp for achiev- 
ing a margin of p on all examples, given that the maximum margin is p*: 


Corollary 4 Assume the weak learner always returns a base hypothesis with an edge Y; > p*. If 


0<p<p*—v, v>0, then AdaBooSt, will converge to a solution with margin at least p on all 


2InN(1—p? 
y2 


examples in at most iterations. 








Proof By Lemma [2 and (4), (5): 


N ' i : 
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The margin is at least p for all examples, if the rhs is smaller than $ ; hence after at most 
InN < 2InN(1 — p?) 
2 = 2 
-mn(1-475) v 


iterations, which proves the statement. 7 

















When p < 0, then inequality (5) can be replaced with the following weaker inequality which holds 
for all distinct p,y, € [—1, 1]: 





exp(—Aa(.t)) < exp (—3(0-1)") 6) 


This leads to the same bound as in the above corollary except that the factor (1 — p?) is omitted. 
Thus when p < 0, the bound on the number of iterations becomes ZIN Rätsch, 2001, page 25). 























4.3 Asymptotic Margin of AdaBoostp 


With the methods shown so far we can determine the asymptotic value of margin of the hypothesis 
produced by the original AdaBoost algorithm. First, we state a lower bound on the margin that is 
achieved by AdaBoostp. There is a gap between this lower bound and the upper bound of Theorem | 
In a second part we consider an experiment that shows that depending on some subtle properties of 
the weak learner, the margin of combined hypotheses generated by AdaBoost can converge to quite 
different values (while the maximum margin is kept constant). We observe that the previously lower 
bound on the margin is quite tight in empirical cases. 








As long as each factor in the rhs of Eq. (3) is smaller than 1, the bound decreases. If the factor is 
at most 1 — u and u > 0, then the rhs converges exponentially fast to zero. The following corollary 
considers the asymptotic case and gives a lower bound on the margin. 




















Corollary 5 (Ratsch (2001)) Assume AdaBoosty generates hypothesis h,,h2,... with edges Y1, Y2, 
.. and coefficients 0), Q2,.... Let YY™ = inf;=1 2,... Yı and assume y"" > p. Furthermore, let 


Yn yi Arh, (Xp) 
n=1,.. N Ei 
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be the achieved margin in the t-th iteration and 6 = sup,— 9... 1. Then the margin ( of the combined 
hypothesis is bounded from below by 


In(1 — p?) —In(1 — (y™")*) 
14+ynin 1+p 
In ( Hr) In ( = 
From (7) one can understand the interaction between p and y™": If the difference between y™" 


and p is small, then the rhs of (7) is small. Thus, if p with p < y™" is large, then 6 must be large, 
i.e. choosing a larger p results in a larger margin on the training examples. A Taylor expansion of 


p= 





(7) 


















































the rhs of (7) shows that the margin is lower bounded by EE This known lower bound (Breiman, 
1999, Theorem 7.2) is greater than p if y™" > p. 




















In Section|{4.1]we reasoned that y™ < p*. If the parameter AdaBoostp is chosen too small, then 
we guarantee only that the margin of the produced linear combination converges asymptotically to 
a value at below p*. In the original formulation of AdaBoost we have p = 0 and we guarantee only 
that AdaBoostg achieves a margin of at least ye = jy, This shortfall in the margin provable 
for AdaBoost motivates our new AdaBoost, which is guaranteed to optimize the margin. 











4.3.1 EXPERIMENTAL ILLUSTRATION OF COROLLARY 5 














To illustrate the above-mentioned gap, we perform an experiment showing how tight (7) can be. We 
analyze two different settings: (i) the weak learner selects the hypothesis with largest edge over all 
hypotheses (i.e. the best case) and (ii) the weak learner selects the hypothesis with minimum edge 
among all hypotheses with edge larger than p* (i.e. the worst case). Corollary 5/holds for both cases 
since the weak learner is allowed to return any hypothesis with edge larger than p*. 

We use random data with N training examples, where N is drawn uniformly between 10 and 
200. The labels are drawn at random from a binomial distribution with equal probability. We use 
a hypothesis set with 10* random hypotheses with range {+1,—1}. We first choose a parameter p 
uniformly in (0,1). Then the label of each hypothesis on each example is chosen to agree with the 
label of the example with probability po First we compute the solution p* of the margin-LP problem 
via the left hand side of (2). Then we compute the combined hypothesis generated by AdaBoostp 
after 104 iterations for p = 0 and p = 1 using the best and the worst selection strategy, respectively. 
The latter algorithm depends on p*. We chose 300 hypothesis sets based on 300 random draws of 
p. The random choice of p ensures that there are cases with small and large optimal margins. For 
each hypothesis set we did two runs of AdaBoostp using the best and worst selection strategies. The 
result of each run is represented as a point in Figure |1! The abscissa is the maximum achievable 
margin p* for each run. The ordinate is the margin of AdaBoosty using the best (green) and the 
worst strategy (red). 

There is a large difference between these selection strategies. Whereas the margin of the worst 
strategy is tightly lower bounded by (7), the best strategy has near maximum margin. These experi- 
ments show that one obtains different results by changing the selection strategy of the weak learning 
algorithm. Our lower bound holds for both selection strategies. The looseness of the bounds is in- 
deed a problem, as we cannot predict where AdaBoost, converges tob However, note that moving 
Î closer to p* reduces the gap (see also Figure 1|[right]). 

































































5. We do not allow duplicate hypotheses or hypotheses that agree with the labels on all examples. 
6. One might even be able to construct cases where the outputs are not at all converging. 
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Figure 1: Achieved margins of AdaBoostp using the best (green) and the worst (red) selection on random 
data for p = 0 [left] and p = ; [right]. On the abscissa is the maximum achievable margin p* 
and on the ordinate the margin achieved by AdaBoostp for one data realization. For comparison 
we plotted the upper bound y = x and the lower bound (7). On the interval [p, 1], there is a clear 
gap between the performance of the worst and best selection strategies. The margin of the worst 
strategy is very close to the lower bound (7) and the best strategy has near maximum margin. If p 
is chosen slightly below the maximum achievable margin then this gap is reduced to 0. 
































Recently, it has been shown by Rudin et al. (2005) that there exist cases where the weighting d’ 
on the examples cycles indefinitely between non-optimal solutions. This proves that AdaBoost does 
not generally maximize the margin. Furthermore, it was shown in Rudin et al. (2004b) that the gap 
exhibited in Figure [llis not an experimental artifact: under certain conditions the lower bound (7) 
was proven to be tight. 






































4.3.2 DECREASING THE STEP SIZE 

















Breiman (1999) conjectured that the inability to maximize the margin is due to the fact that the 
normalized hypothesis coefficients may “circulate endlessly through the convex set”, which is de- 
fined by the lower bound on the margin. In fact, motivated from our previous experiments, it seems 
possible to implement a weak learner that appropriately switches between optimal and worst case 
performance, leading to non-convergent normalized hypothesis coefficients. 

















Rosset et al. (2002) have shown that AdaBoost with infinitesimally small step sizes may max- 
imize the margin, if the weak learner uses the best selection strategy. This is similar to what we 
found empirically for finite step sizes and motivates us to analyze AdaBoostp with step sizes chosen 
as follows: 





1 1 
In tY Nin oR 


a =; 1%, 2 1=p’ 
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for some n > 0. For n = 1 we recover AdaBoostp. Following the same proof technique as for 
Corollary 5} we can show that under the same conditions as given in Corollary 5 
5 > (U tv) exp(—6) + (1 =v) exp(6)) 


A ki 


Q 























where & = 1 In E lin e Note that if 7 goes to zero, then 6 = y. Interestingly, this is inde- 
pendent of the choice of p. Thus if the weak learner always returns hypotheses with edges y, > p* 
(t = 1,2,...), where p* is the maximum margin, then by the Min-Max Theorem, the margin is 
maximized when ņ goes to zero. However, there are no guarantees on the convergence speed. 


4.4 Convergence of AdaBoost, 


The AdaBoost, algorithm is based on two insights: 





e According to the discussion after Lemma [3] the most rapid convergence to a combined hy- 
pothesis with margin p* — v occurs for AdaBoosty when one chooses p; as close as possible 
to p*—v. 


e For distributions on the examples that are hard for the weak learner (i.e. the weak learner 
achieves a small edge), the edge y; will be close to p*. 


The idea is that by choosing p; = (min,—1,..+¥) — V we concentrate on the hardest distribution we 
generated so far and can so find a close overestimate of p* — v. This forces an acceleration of the 
convergence to a large margin and leads to distributions for which the weak learner has to return 
small edges. 

Note that if the weak learner always returns hypotheses with edge y, = p* which is the worst 
case under the assumption that y, > p*, then p; = p* — v in each iteration. In this case the same 
smallest step size is taken in every iteration which is determined by p* and v. This smallest step 
size decreases with the desired accuracy v, which matches the intuition from Section |4.3.2 that 
decreasing the step size achieves larger and therefore more accurate margins. 

We will now state and prove our main theorem: 














Theorem 6 Assume the weak learner always returns a base hypothesis with an edge Y; > p*. Then 
after = 2InN iterations AdaBoost, (Algorithm |2) is guaranteed to produce a combined hypothesis f of 
margin ie least p* — V. 








Proof Let p = p* — v be the margin that we would like to achieve. By assumption on the perfor- 
mance of the weak learner, p* < min,=1,... T Y = van and thus p = p*—v < yan —v. In step 3 (d) 
of Algorithm 2, p, was set to y@" — v. Hence p < p; for each iteration. 

Lemmas |2/and 3|imply that 


T 1+p, lps) 18. /1-p: 
pas ) < 
7 onto) p) <Į[ex p(-4 in (3%) 2 n(—)) 
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We now rewrite the rhs using O, = J In ip. 
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By (1), o; > 0 since p; < W. By replacing p by its upper bound p; we get: 


T ‘mene n 
< [Jex p( rq (1+2) 1 Prin (i 2) 
a 1+% 2 l=; 











Finally, by (6) we have: 


2 Ty? 
A(P:;¥)) < [Je m =) < exp(- z. 


Il 
pi 


1 


~ 
Il 


is at most +}, then by the above chain of inequalities, $ Z}; I(ynf (Xn) <P) < 4 and the margin of 


each of the N examples is at least p. The theorem now follows from the fact that i < exp ( — 5TVv?), 


if the number of iterations T is at least ZDN . 

















If one assumes p; > 0, then Theorem|6|could be improved by a factor of (1 — p?) in each iteration, 
using the refined upper bound of Corollary |4! Since p; > p* —v, one would obtain the bound 


eo if p* > v, but this factor will only matter for very large margins. 





4.5 Infinite Hypothesis Sets 


So far we have implicitly assumed that the hypothesis space is finite. In this section we will show 
that this assumption is (often) not necessary. Also note, if the output of the hypotheses is discrete, 
the hypothesis space is effectively finite (Ratsch et al., 2002). For infinite hypothesis sets, Theorem |1 
can be restated in a weaker form as: 















































Theorem 7 (Weak Min-Max, e.g. Nash and Sofer (1996)) 





N 
y“ := min sup Ł YnhA(Xn)dn > uP min yp, Ł Oghg(Xn) =: P*, (8) 
d hEH n=1 =l... q:Aq>0 


where d € PN, a € P| with finite support. 


We call F = y“ — * the “duality nA In aps for any d € PN, supjcge V1 Ynh(Xn)dn > Y“ 

In theory the duality gap may be nonzero. STORET, Lanita 3 and Theorem |6|/do not assume 
finite hypothesis sets and show that the margin will converge arbitrarily close to p*, as long as the 
weak learning algorithm can return a hypothesis in each iteration that has an edge not smaller than 


p“. 
In other words, the duality gap may result from the fact that the sup on the left side cannot be 
replaced by a max, i.e. there might not exists a single hypothesis h with edge larger or equal to 
p*. By assuming that the weak learner is always able to pick good enough hypotheses (> p*), one 
automatically gets by Lemma/3/that T = 0. 
Under certain conditions on H this maximum always exists and strong duality holds (for details 
see e.g. Ratsch et al., 2002; Ratsch, 2001; Hettich and Kortanek, 1993; Nash and Sofer, 1996): 




























































































Theorem 8 (Strong Min-Max) [f the set of vectors {(h(x1),...,h(xw)) | h€H} is compact, then T= 
0. 
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In general, this requirement can be fulfilled by the weak learning algorithms whose outputs 
continuously depend on the distribution d. Furthermore, the outputs of the hypotheses need to be 
bounded (cf. step|3alin AdaBoostp). The first requirement might be a problem with weak learning 
algorithms that are some variants of decision stumps or decision trees. However, there is a simple 
trick to avoid this problem: Roughly speaking, at each point with discontinuity d, one adds all 
hypotheses to H that are limit points of L(S,d*), where {d*}*_, is an arbitrary sequence converging 
to d and L(S,d) denotes the hypothesis returned by the weak learning algorithm for distribution d 
and training sample S (Ratsch, 2001). This procedure assures that H is closed. 

The above theorem is applied in Appendix |B to obtain iteration bounds for AdaBoost;, in the 
context of learning a convex combination of support vector kernels. 



































5. Experimental Comparison 


In this section we discuss two experiments: The first one shows that our theoretical bounds can be 
tight on artificial data and the second one compares our algorithm to the one proposed in Rudin et al. 
(2004a). 

















5.1 Illustration on Toy Examples 


We are aware that maximizing the margin of the ensemble does not lead to improved generalization 

erformance in all cases. In fact for fairly noisy data sets the opposite has been reported (cf. Quinlan, 
1996; Breiman, 1999; Grove and cua 1998; Rätsch et al., 2001). Also, Breiman (1998 
reported an example where the margins of all examples are larger in one ensemble than another and 
the latter generalized considerably better. 
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Figure 2: The two discriminative dimensions of our separable one hundred dimensional data set. 


Nonetheless, the theoretical bounds on the generalization error of linear classifiers improves 
with the margin. We therefore expect to be able to measure differences in the generalization error 
between a function that maximizes the margin and one that does not. Similar results have been 
obtained in Schapire et al. (1998) on a multi-class optical character recognition problem. 
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Here we report experiments on artificial data to illustrate how our algorithm works and how 
it compares to AdaBoost. Our data is 100 dimensional and contains 98 nuisance dimensions with 
uniform noise. The other two dimensions are plotted exemplary in Figure 2, For training we use 
only 100 examples which means that controlling the capacity of the ensemble is essential. 




















As the weak learning algorithm we use C4.5 decision trees provided by Quinlan (1992) using 
an option to control the number of nodes in the tree. We have tuned C4.5 to generate trees with 
about three nodes. Otherwise, the weak learner often classifies all training examples correctly and 
over-fits the data already. Furthermore, since in this case the margin is already maximum (equal to 
1), boosting algorithms would stop since y= 1. We therefore need to limit the complexity of the 
weak learner, in good agreement with the bounds on the generalization error (Schapire et all 1998). 




















Moreover, we have to deal with the fact that C4.5 cannot use weighted samples. We therefore 
use weighted bootstrapping (e.g. Efron and Tibshirani, 1994). However, this amplifies the problem 
that the resulting hypotheses might in some cases have an edge smaller than the maximum margin, 
which according to the Min-Max-Theorem should not occur if the weak learner performs optimally. 
We deal with this problem by repeatedly calling C4.5 on different bootstrap realizations if the edge 
is smaller than the margin of the current linear combination. Furthermore, for AdaBoost,, a small 
edge of one hypothesis can spoil the margin estimate p,;. We address this problem by resetting 
Pr = fr +V, whenever p; < fr, where f; is the margin of the currently combined hypothesis. 





























In Figure 3|we see a typical run of AdaBoost, Marginal AdaBoost, AdaBoost, and Arc-GV for 











v = .1. For comparison we plot the margins of the hypotheses generated by AdaBoost (cf. Figure|3 








(left)). One observes that it is not able to achieve a large margin efficiently. After 1000 iterations 
Î = .37. 

















Marginal AdaBoost as proposed in Rätsch and Warmuth (2002) proceeds in stages and first tries 
to find an estimate of the margin using a binary search. It calls AdaBoost, three times. The first call 
of AdaBoost, for p = 0 stops after four iterations because it has generated a consistent combined 
hypothesis. The lower bound / on p* as computed by Marginal AdaBoost is / = .07 and the upper 
bound u is .94. The second time p is chosen to be in the middle of the interval [/, u] and AdaBoostp 
reaches the margin of p = .51 after 80 iterations. The interval is now [.51,.77]. Because the length 
of the interval u — l = .27 is small enough, Marginal AdaBoost leaves the loop through an exit 
condition, calls AdaBoostp the last time for p = u—v = .41 and finally achieves the margin of .55. 





In a run of Arc-GV for thousand iterations we observe a margin of the combined hypothesis of 
.53, while for our new algorithm, AdaBoost;, we find .58. In this case the margin for AdaBoost;, is 
larger than the margins of all other algorithms when executed for one thousand iterations. It starts 
with slightly lower margins in the beginning, but then catches up due the better choice of the margin 
estimate. 





C4.5 AdaBoost Marginal AdaBoost AdaBoost, 


Esen 1.4.11% 4.0.11% 3.6+ .10% 3.5+ .10% 
p — .31+.01 .55 + .01 58+.01 
































Table 2: Estimated generalization performances and margins with confidence intervals for decision 
trees (C4.5), AdaBoost, Marginal AdaBoost and AdaBoost, on the toy data. All numbers 
are averaged over 200 splits into 100 training and 19900 test examples. 
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Figure 3: Illustration of the achieved margin of AdaBoosto (left), Marginal AdaBoost (middle), 
Arc-GV, and AdaBoost, (right) at each iteration. Marginal AdaBoost calls AdaBoostp 
three times while adapting p (dash-dotted). We also plot the values for / and u as in 
Marginal AdaBoost (dashed). (For details see Ratsch and Warmuth, 2002) AdaBoost; 
achieves larger margins than AdaBoost. Compared to Arc-GV it starts slower, but then 
catches up in the later iterations. Here the correct choice of the parameter p is important. 
































In Table|2/we see the average performances of the four classifiers. For AdaBoost and AdaBoost; 
we combined 200 hypotheses for the final prediction. For Marginal AdaBoost we use v = .1 and let 
the algorithm combine only 200 hypotheses for the final prediction to get a more fair comparison. 
We see a large improvement of all ensemble methods over the single classifier. There is also a slight, 
but — according to a t-test with confidence level 98% — significant difference between the generaliza- 
tion performances of AdaBoost and Marginal AdaBoost as well as AdaBoost and AdaBoost,. Note 
also that the margins of the combined hypothesis achieved by Marginal AdaBoost and AdaBoost;, 
are on average almost twice as large as for AdaBoost. The difference in generalization performance 
between AdaBoost; and Marginal AdaBoost is not statistically significant. 

The differences between the achieved margins of both algorithms seem slightly significant 
(96%). The slightly larger margins generated by Marginal AdaBoost can be attributed to the fact that 
it uses many more calls to the weak learner than AdaBoost, and after an estimate of the achievable 
margin is available, it starts optimizing the linear combination using this estimate. 

It would be natural to use a two-pass algorithm: In the first pass use AdaBoost; to get a margin 
estimate p size at least p* — v and then use this estimate in a final run of AdaBoosty. The hypothesis 
produced in the second pass should have a larger margin and use fewer base hypotheses. 


5.2 Heuristics for Tuning the Precision Parameter v 


21nN 
y2 


p* — v. Thus if the algorithm is allowed to run for T iterations, then v should be set to vr = 4/ ann 
If v is chosen much larger than vy, then after T iterations AdaBoost often achieves a margin below 
p* —vr. Similarly, if v is chosen much smaller than v7, then AdaBoost; starts too slowly and after 
T iterations its margin is typically again below p* — vr. 

Recently, Rudin et al. 2004a,c proposed an algorithm, called Coordinate Ascent Boosting, 
which solves the same problem as AdaBoost,. Their analysis of the algorithm shows that it needs 


Our main results says that after 





iterations AdaBoost,, produces a hypothesis of margin at least 
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at most Q(v~>) iterations to achieve a margin of at least p* — v. While this theoretical result is 
clearly inferior to the guarantees which we provide for AdaBoost,, their experimental evaluation 
of the algorithms seemed to suggest that the algorithm requires significantly fewer iterations than 
AdaBoost, in practice. However, their observations were only due to the improper choice of the 
accuracy parameter v for AdaBoost*: For v = 107° (as chosen in their study), AdaBoost*, would 
need millions of iterations to achieve a guaranteed margin p* — v. However, only the first 20K it- 
erations were displayed and in this range their algorithms achieve a larger margin. For T = 20K 
and N = 50, the precision parameter prescribed by our bounds is vr = .02. When this parameter is 
used, then AdaBoost* clearly beats all the other related algorithms (cf. Figure 4). We leave it to the 
reader to explore other heuristics for tuning v based on the theoretical results of this paper (See also 
the discussion at the end of the last subsection). 
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Figure 4: AdaBoost, with different choices of v is compared to Arc-GV and the Coordinate Ascent 


Algorithm on the same artificial dataset 1 used in 


this dataset from a figure given in 

















Rudin et al. (2004c) (We reconstructed 











Rudin et al. 





2004b 











): The number of iterations is 


T = 20K, the dimension of the examples is N = 50, and we assume that the base learner 
returns a hypothesis with maximum edge. If v is set to a reasonably close range around 
the value vr = .02 prescribed by our bound, then AdaBoost, achieves the margin which is 
significantly larger than the margins achieved by the other algorithms. If v = .001 < vr 
as chosen in Rudin et al. (2004c), then AdaBoost; starts too slowly. In the case when the 
base learner returns a random hypothesis with edge only at least as large as p*, then our 
algorithm compares even more favorably (not shown). 
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6. Conclusion 


We have analyzed a generalized version of AdaBoost in the context of large margin algorithms. 
From von Neumann’s Min-Max theorem we know that if the weak learner always returns a hypoth- 
esis with weighted classification error less than 5 — ty then the maximum achievable margin p* is 
at least y. The asymptotic analysis lead us to a lower bound on the margin of the final hypotheses 
generated by AdaBoostp, which was shown to be rather tight in empirical cases. Our results indicate 
that vanilla AdaBoost generally does not maximize the margin, and only achieves a margin of about 


half the optimum. 


To overcome these problems we provided an algorithm AdaBoost, with the following provable 
guarantees: It produces a linear combination with margin at least p* — v and the number of base 
hypotheses used in this linear combination is at most Zn, The new algorithm decreases its esti- 
mate p of the margin iteratively, such that the gap between the best and the worst case becomes 
arbitrarily small. Our analysis did not require additional properties of the weak learning algorithm. 


In simulation experiments we have illustrated the validity of our theoretical analysis. 





Appendix A. Margins 


First recall the definition of margin used in this paper, which is defined for a fixed set of exam- 
ples {(Xn,yYn): 1 < n < N} and a set of hypotheses H = {h,...,hy} (here finite for the sake of 
simplicity): 


M 
p*(H)=max min y, > Omlm(Xn), Where œ is on the simplex P™. 
a n=1,...,.N 


i m=1 


Note that we minimize over the margins of individual examples and maximize over the hyperplanes. 
Define the one-norm margin pł( H) in the same way but now © lies in the larger set {0 : a € 


R” and ||o||; = 1}. It is well known that for a fixed example (Xn, yn) and normal œ € R™, the one- 


norm margin En=1 Onlin (Xn) 


a (Mangasarian, 1999; Ratsch et al., 2002), where the latter distance is defined as 


is the minimum @..-distance of the example to the hyperplane with normal 





























inf yn max |hm(Xn) — Zm|- 
ze€RM s.t. a-z=0 m=1,....M 


Note that in this appendix, margins are defined as a function of the the hypotheses set H because 
we will vary this set in a moment. Let cl(#) be the closure of H under negation, i.e. cl(H) = 
H U{-h:h € H}. Now, the following relationships are straightforward: 


1. p*(H) < pi (H), p*(cl(H#)) = 0, and p*(cl(H)) > pi (H). 
2. If p*(cl(H)) > 0, then p*(cl(H)) = pi (H). 
3. If pï (H) 2 0, then p*(cl(H)) = p}(H). 


In summary, if the one-norm margin of H is non-negative, then the margin of the closed hypotheses 
class cl(#) coincides with the one-norm margin. 
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Appendix B. An Application to Multiple Kernel Learning 























Sonnenburg et al. (2005) proposed a new algorithm for solving the multiple kernel learning (MKL) 
problem that was introduced in Lanckriet et al. (2004); Bach et al. 2004}, The idea of MKL is to 
find a convex combination of J support vector kernels kj : X x X +> R (j = 1,...,J) that maximizes 
the SVM soft margin (cf. Bach et al. (2004)). In Sonnenburg et al. 2005) the original quadratically- 
constraint quadratic program was reformulated to the following semi-infinite linear program: 






























































min sup y B iS; (9) 


PEP aca j=] 


where 


Sj(a) iS “aig P ArAsYrYskj (Xr, Xs m 


2 el n=1 
A := fa 


and C is the SVM regularization constant. Note that this problem has infinitely many constraints: 
one for every vector © in its domain A. Note that problem (9) is of the same type as the semi-infinite 
programming problem (8) which can be solved with AdaBoost, (cf. discussion in Section (4.5). 
Since the S;(&œ) are continuous functions and A is compact, it follows from Theorem [8] that the 
duality gap is zero. 

When AdaBoostx, is applied to this problem, a hypothesis with large edge has to be found in 
each iteration. In this case the hypotheses are & vectors and the edge is 


J 
$} BSa) = -z Lwa s¥rYs (Łow (Xr, Xs ) +} 0%. 
j=l ; 


It has been noted that the edge in e case is nothing else than the negative SVM objective function 
for the combined kernel k(x,, Xs) = L_, Bjk;(X,,Xs). Hence, identifying an & vector with maximum 
edge amounts to solving the Cae SVM gaali optimization problem. Fortunately, many effi- 
cient SVM packages are available to solve this problem. Thus, the MKL problem can be efficiently 
solved using AdaBoost and our iteration bound for AdaBoost, is applicable. 
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